Add LTX-2.3 text-to-video generation support#402
Conversation
|
🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request successfully introduces support for LTX-2.3 text-to-video generation. It includes significant updates to the transformer architecture (gated attention, cross-modal modulation) and the denoising pipeline (4-way batched denoising for STG/CFG/MIG). The implementation is high-quality and integrates well with the existing LTX-2 infrastructure.
🔍 General Feedback
- Redundant Patch File: The
scratch_diff.patchfile was likely added by mistake and should be removed before merging. - Robustness: A few areas in the pipeline (like
audio_channelsfallback and upsampler parameter inference) could be made more robust to handle different model versions and naming conventions. - Optimization: The use of
nnx.jitfor the vocoder and the optimized sequence length in smoke tests are excellent additions for performance and stability.
|
🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @prishajain1, but I was unable to process your request. Please see the logs for more details. |
|
🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @prishajain1, but I was unable to process your request. Please see the logs for more details. |
|
🤖 Hi @prishajain1, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request introduces comprehensive support for LTX-2.3 text-to-video generation, including the end-to-end pipeline, model updates, and a new vocoder with bandwidth extension (BWE). The implementation correctly handles complex features like Spatio-Temporal Guidance (STG) and Modality Isolation Guidance (MIG) using a 4-way batched denoising approach in JAX.
🔍 General Feedback
- STG/MIG Logic: The implementation of the 4-way split denoising logic and the corresponding delta formulations for guidance is impressive and aligns well with the LTX-2.3 technical requirements.
- Efficiency: Utilizing
nnx.scanfor the denoising loop ensures optimal performance on TPU/GPU hardware. - Redundancy: I identified some redundant initializations and assignments in the transformer and autoencoder models that should be cleaned up.
- Parameter Initialization: Double-check the usage of
nnx.Paramwithkernel_init, asnnx.Paramtypically only accepts the data tensor and might ignore additional keyword arguments.
1890edb to
6e5961c
Compare
Perseus14
left a comment
There was a problem hiding this comment.
Left a few comments. PTAL
Additional Comments
- Could you test LTX2 in this branch and ensure that there is no regression?
- Please test with scan_layers true/false as well
- Please add e2e generation time as well as each component if possible
|
🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This Pull Request introduces comprehensive support for the LTX-2.3 multi-modal (audio-video) transformer model. It includes key architectural updates such as Gated Cross-Modal Attention, Prompt AdaLN, and a sophisticated Bandwidth Extension (BWE) Vocoder. The implementation is technically sound, highly optimized for JAX/TPU, and follows the project's established modular patterns.
🔍 General Feedback
- 4-Way Batched Denoising: The integration of Spatiotemporal Guidance (STG) and Modality Isolation Guidance (MIG) via 4-way batching is a major highlight, enabling advanced generation features.
- Performance: Excellent use of JIT caching for the vocoder and conditional VAE replication to optimize inference latency.
- Code Quality: The transition to more explicit logic for guidance (using x0 space) improves both readability and correctness compared to standard velocity-based CFG.
Added the above details for each component in the PR description along with details of what has been tested. |
|
unittests failure is unrelated to changes in this PR |
This PR introduces end-to-end pipeline and model changes to support the LTX-2.3 multi-modal (audio-video) transformer model. It enables integrated text-to-audio-video generation using Gemma-based text conditioning, latent upsamplers, and vocoders.
Key architectural changes
to_gate_logits) applied to all attention operations in the block (Self-Video, Self-Audio, Prompt-Cross, and Modal-Cross).self.prompt_adaln). For this specific cross-attention modulation, it derives scale and shift parameters directly from the continuous noise level (sigma)per_modality_projections=True). Instead of a shared feature extractor, it applies per-token RMS normalization to the raw hidden states and passes them through two separate linear projection layers (video_text_proj_inandaudio_text_proj_in) before sending them to the respective video and audio connectors.LTX2VocoderWithBWE).Files added/modified
ltx2_3_video.yml file: New config file for LTX2.3vocoder_ltx2.py: Added support for BWE vocoderltx2_pipeline.py: Enabled 4-way sliced batched inference (Uncond, Cond, Perturb, Isolated) and integrated velocity/x0 conversion delta equations with guidance rescaling.transformer_ltx2.py: Propagated modality/perturbation masks to transformer blocks and integrated prompt adaptive layer norms.generate_ltx2.py,pyconfig.py,common_types.py: Added support for LTX2.3ltx2_utils.py: Added support to load new LTX2.3 specific weightsattention_ltx2.py: Added support for gated attention and perturbed attentionautoencoder_kl_ltx2.py: Added support for differentupsample_typeembeddings_connector_ltx2.py: Added gated attention configurations (gated_attn) support to intermediate transformer block connectors.feature_extractor_ltx2.py: support forper_modality_projectionsparameter addedtext_encoders.py: Implemented dual-modality parallel text connectors routing, token-wise RMS scaling, and independent video-audio linear projections.Sample outputs
Component wise breakdown
Tested:
scan_diffusion_loop = Trueandscan_diffusion_loop = Falsescan_layers = Trueandscan_layers = False